It is always a good decision to make your code as readable as possible. Not only so that others can pick it up and use it easily, but so that you don't end up hating your past self when you take a look at something you wrote several months prior. Pipelines are a great tool for integrating and reusing a series of data transformations and fits into a workflow. In this exercise you'll build some pipelines and feature unions using the Concrete Compressive Data Set.
1 - Head over to the Machine Learning Repository, download the Concrete Compressive Data Set, put it into a dataframe, and split into training and test sets. Be sure to familiarize yourself with the data before proceeding.
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import resample
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.feature_selection import f_regression, SelectKBest, SelectFromModel
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve, GridSearchCV,\
learning_curve, StratifiedKFold, LeaveOneOut, KFold
from sklearn.metrics import mean_squared_error, confusion_matrix, precision_score, accuracy_score, recall_score,\
f1_score, precision_recall_curve, roc_curve, roc_auc_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression, LassoCV, LogisticRegression, Ridge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
In [17]:
concrete = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls')
concrete.columns = ['cement', 'blast_furnace_slag', 'fly_ash', 'water', 'superplasticizer',
'coarse_aggregate', 'fine_aggregate', 'age', 'concrete_compressive_strength']
In [3]:
concrete.head()
Out[3]:
In [4]:
concrete.info()
In [5]:
concrete.describe().T
Out[5]:
In [18]:
Xtrain, Xtest, ytrain, ytest = train_test_split(concrete.drop('concrete_compressive_strength', axis=1),
concrete.concrete_compressive_strength,
test_size=0.2, random_state=42)
2 - Build a pipeline for polynomial fitting, fit polynomials of degree 1 to the number of features, and plot your training and testing errors for each. Comment on your results
In [19]:
# scaling+pol+linear regression
pipe = Pipeline([
('scl', StandardScaler()),
('poly', PolynomialFeatures()),
('lr', LinearRegression())
])
In [22]:
scores = []
# looping through degrees
for n in range(1, Xtrain.shape[1]+1):
# set pipeline params
pipe.set_params(poly__degree=n)
# fit and score
pipe.fit(Xtrain, ytrain)
scores.append([n, pipe.score(Xtrain, ytrain), pipe.score(Xtest, ytest),
mean_squared_error(pipe.predict(Xtrain), ytrain), mean_squared_error(pipe.predict(Xtest), ytest)])
scores = pd.DataFrame(scores)
scores.columns = ['degree', 'train score', 'test score', 'train mse', 'test mse']
In [23]:
scores
Out[23]:
In [19]:
# plot on two separate axes cause the scales are very different
fig, ax = plt.subplots(2, figsize=(10, 8))
ax[0].plot(scores.degree, scores['train score'], label='train score')
ax[1].plot(scores.degree, scores['test score'], label='test score');
In [25]:
# single plot for n=1,2,3
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(scores.iloc[:3, 0], scores.iloc[:3, 1], label='train score')
ax.plot(scores.iloc[:3, 0], scores.iloc[:3, 2], label='test score');
Increasing the number of features increases the training score but we get very bad generalization (with a really strange peak for the fifth degree polynomial). The best score is for the third degree (84%).
3 - Build a pipeline that will perform feature selection on the dataset using the F-Statistic, producing the same plots as in part (2). Comment on your results.
In [26]:
# scaling, select k best and linear regression
pipe = Pipeline([
('scl', StandardScaler()),
('best', SelectKBest(f_regression)),
('lr', LinearRegression())
])
In [28]:
scores = []
# looping over number of features
for n in range(1, Xtrain.shape[1]+1):
# set pipe params
pipe.set_params(best__k=n)
# fit and score
pipe.fit(Xtrain, ytrain)
scores.append([n, pipe.score(Xtrain, ytrain), pipe.score(Xtest, ytest),
mean_squared_error(pipe.predict(Xtrain), ytrain), mean_squared_error(pipe.predict(Xtest), ytest)])
scores = pd.DataFrame(scores)
scores.columns = ['n_features', 'train score', 'test score', 'train mse', 'test mse']
In [29]:
scores
Out[29]:
In [30]:
# train and test on a single plot
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(scores.n_features, scores['train score'], label='train score')
ax.plot(scores.n_features, scores['test score'], label='test score')
ax.legend();
After adding the sixth variable we have little improvement in the results.
4 - Build a pipeline that standardizes your data, performs feature selection via regularization, and then fits a model of your choice. Produce the same plots as above and comment on your results.
In [37]:
# scaling, select from model using lassoCV and lassoCV
pipe = Pipeline([
('scl', StandardScaler()),
('sfmod', SelectFromModel(LassoCV())),
# ('lasso', LassoCV())
('forest', RandomForestRegressor()) # used in solution
])
In [38]:
scores = []
# looping through threshold values
for c in np.arange(0.1, 2.1, 0.1):
pipe.set_params(sfmod__threshold=str(c) + '*mean')
pipe.fit(Xtrain, ytrain)
scores.append([c, pipe.score(Xtrain, ytrain), pipe.score(Xtest, ytest),
mean_squared_error(pipe.predict(Xtrain), ytrain), mean_squared_error(pipe.predict(Xtest), ytest)])
# sel = pipe.named_steps['sfmod']
# print(Xtrain.columns[sel.transform(np.arange(len(Xtrain.columns)).reshape(1, -1))])
scores = pd.DataFrame(scores)
scores.columns = ['threshold', 'train score', 'test score', 'train mse', 'test mse']
In [39]:
scores
Out[39]:
In [40]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(scores.threshold, scores['train score'], label='train score')
ax.plot(scores.threshold, scores['test score'], label='test score')
ax.legend();
In [41]:
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(scores.threshold, scores['train mse'], label='train mse')
ax.plot(scores.threshold, scores['test mse'], label='test mse')
ax.legend();
Test score is better than training using LassoCV as an estimator, using random forest we get more overfitting but also better overall performances than all the previous methods. The best choice for the threshold seems to be around 0.5.
5 - Create two pipelines for feature selection of a technique of your choice, scaling the data before hand. Then, join these two pipelines with a FeatureUnion
, and fit a polynomial model, also in a pipeline. Comment on your results.
In [42]:
# joining select from model using lasso and select k best using the best parameters from before
pipe = Pipeline([
('scl', StandardScaler()),
('featsel', FeatureUnion([
('sfmod', SelectFromModel(LassoCV(), threshold='0.5*mean')),
('best', SelectKBest(k=6))
])),
('poly', PolynomialFeatures(degree=3)),
('lr', LinearRegression())
])
In [43]:
pipe.fit(Xtrain, ytrain)
print(pipe.score(Xtrain, ytrain))
print(pipe.score(Xtest, ytest))
In [45]:
print(mean_squared_error(pipe.predict(Xtrain), ytrain))
print(mean_squared_error(pipe.predict(Xtest), ytest))
In [44]:
pipe.named_steps['poly'].n_input_features_
Out[44]:
By joining the pipelines I get the features from both the feature selections (which is quite strange as a model) but I get a very good test score of 81% (which is still worse than the 84% obtained only with poly features).
It is very important that you have more than one tool in your toolbox for evaluating the usefulness of your model, as in different context, different metrics are preferred. For example, suppose a new medical test is developed for detecting cancer that has a 0.25 probability of incorrectly labeling a patient of having cancer when they in fact do not, but a 0.001 probability of labeling a cancer patient as cancer free. With this sort of test, you can be sure that those who do have cancer will almost certainly be classified correctly, but a positive does not necessarily mean that the patient has cancer, meaning additional tests are in order. These metrics have different names and depending on the situation, you may be interested in minimizing different quantities, which is the topic we will explore in this exercise.
1 - Head over the Machine Learning Repository and download the Breast Cancer Diagnostic Dataset, put it in a dataframe, and split into testing and training sets.
In [46]:
breast = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data',
header=None)
breast.columns = ['id','clump_thickness','uniformity_cell_size','uniformity_cell_shape','marginal_adhesion',
'single_epithelial_cell_size','bare_nuclei','bland_chromatin','normal_nucleoli','mitoses','class']
In [4]:
breast.info()
Bare nuclei is an object: this is due to the presence of ? for missing values, which I'm going to replace with 1s.
In [5]:
breast.bare_nuclei.value_counts()
Out[5]:
In [47]:
breast.bare_nuclei = breast.bare_nuclei.apply(lambda x: x.replace('?', '1'))
breast.bare_nuclei = pd.to_numeric(breast.bare_nuclei)
Also, I'm encoding the class column with 0s and 1s:
In [48]:
# 2 for benign, 4 for malignant
le = LabelEncoder()
le.fit([2, 4])
breast['class'] = le.transform(breast['class'])
In [21]:
breast.describe().T
Out[21]:
In [49]:
Xtrain, Xtest, ytrain, ytest = train_test_split(breast.drop('class', axis=1), breast['class'], test_size=0.2, random_state=0)
2 - Using a classification algorithm of your choice, fit a model to the data and predict the results, making your results as accurate as possible.
In [50]:
# scale and KNN
pipe = Pipeline([
('scl', StandardScaler()),
('knn', KNeighborsClassifier())
])
In [51]:
scores = []
# looping through number of neighbors
for k in range(1, 20):
pipe.set_params(knn__n_neighbors=k)
temp = cross_val_score(estimator=pipe, X=Xtrain, y=ytrain, cv=10)
scores.append([k, np.mean(temp), np.std(temp)])
scores = pd.DataFrame(scores)
scores.columns = ['k', 'kfold mean', 'kfold std']
In [52]:
scores.sort_values(by='kfold mean', ascending=False)
Out[52]:
Seven is the number of neighbors with the best CV score:
In [53]:
pipe.set_params(knn__n_neighbors=7)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)
Out[53]:
The accuracy is almost 98%, which is quite good!
3 - Using your model in part (2), compute the following quantities, without using skelarn.metrics
.
- True Positives
- True Negatives
- False Positives
- False Negatives
In [27]:
ypred = pipe.predict(Xtest)
In [28]:
TP = sum((ypred[ytest==1]==ytest.values[ytest.values==1]))
TN = sum((ypred[ytest==0]==ytest.values[ytest.values==0]))
FP = sum((ypred[ytest==0]!=ytest.values[ytest.values==0]))
FN = sum((ypred[ytest==1]!=ytest.values[ytest.values==1]))
In [29]:
print('True Positives: {}\nTrue Negatives: {}\nFalse Positives: {}\nFalse Negatives: {}'.format(TP, TN, FP, FN))
4 - Using your results in part (3), compute the following quantities.
- Sensitivity, recall, hit rate, or true positive rate (TPR)
- Specificity or true negative rate (TNR)
- Miss rate or false negative rate (FNR)
- Fall-out or false positive rate (FPR)
- Precision
- F1
- Accuracy
In [30]:
TPR = TP / (FN+TP)
FPR = FP / (FP+TN)
TNR = 1 - FPR
FNR = 1 - TPR
PRE = TP / (TP+FP)
F1 = 2*PRE*TPR / (PRE+TPR)
ACC = (TP+TN) / len(ypred)
In [31]:
print('TPR: {}\nTNR: {}\nFNR: {}\nFPR: {}\nPRE: {}\nF1: {}\nACC: {}'.format(TPR, TNR, FNR, FPR, PRE, F1, ACC))
5 - Check your results in part (4) using sklearn
.
In [32]:
print(confusion_matrix(y_true=ytest, y_pred=ypred))
print('TPR', recall_score(ytest, ypred))
print('PRE', precision_score(y_true=ytest, y_pred=ypred))
print('F1', f1_score(y_true=ytest, y_pred=ypred))
print('ACC', accuracy_score(y_true=ytest, y_pred=ypred))
6 - Plot the precision and recall curve for your fit.
In [34]:
ypredprob = pipe.predict_proba(Xtest)
In [44]:
# get precision and recall values
precision, recall, _ = precision_recall_curve(ytest, ypredprob[:, 1])
# plot them and fill the curve
plt.figure(figsize=(16, 8))
plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve');
7 - Plot the ROC curve for your fit.
In [49]:
# getting FPR and TPR for the ROC curve and the area under it
fpr, tpr, _ = roc_curve(ytest, ypredprob[:, 1])
roc_auc = roc_auc_score(ytest, ypredprob[:, 1])
# plot and fill the curve
plt.figure(figsize=(16, 8))
plt.step(fpr, tpr, color='b', alpha=0.2, where='post')
plt.fill_between(fpr, tpr, step='post', alpha=0.2, color='b')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('ROC curve, area = {:.5f}'.format(roc_auc));
A problem that you will see crop up time and time again in Data Science is overfitting. Much like how people can sometimes see structure where there is none, machine learning algorithms suffer from the same. If you have overfit your model to your data, it has learned a "pattern" in the noise rather than the signal you were looking for, and thus will not generalize well to data it has not seen.
Consider the LASSO Regression model which we have used previously. Like all parametric models, fitting a LASSO Regression model can be reduced to the problem of finding a set of $\hat{\theta_i}$ which minimize the cost function given the data. But notice that unlike a standard linear regression model, LASSO Regression has an additional regularization parameter. The result of this is that the $\hat{\theta_i}$ are dependent not only on our data, but also on this additional hyperparameter.
So now we have three different problems to juggle while we are fitting our models: overfitting, underfitting, and hyperparameter tuning. A common technique to deal with all three is cross-validation which we will explore in this exercise.
Also, you may find yourself in a situation where you do not have enough data to build a good model. Again, with some simple assumptions, we can "pull ourselves up by our bootstraps", and get by reasonably well with the data we have using a method called bootstrapping.
In this exercise, you'll explore these topics using the wine
dataset.
1 - Head over to the Machine Learning Repository, download the Wine Dataset, and put it in a dataframe, being sure to label the columns properly.
In [54]:
wine = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)
# wine = pd.read_csv('wine.data', header=None)
wine.columns = ['class', 'alcohol', 'malic_acid', 'ash', 'alcalinity_ash', 'magnesium',
'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity',
'hue', 'OD280_OD315', 'proline']
In [4]:
wine.head()
Out[4]:
In [5]:
wine.info()
In [6]:
wine.describe().T
Out[6]:
In [47]:
# Xtrain, Xtest, ytrain, ytest = train_test_split(wine.drop('class', axis=1),
# wine['class'],
# test_size=0.2,
# random_state=5,
# stratify=wine['class'])
2 - Separate your data into train and test sets, of portions ranging from test_size = 0.99
to test_size = 0.01
, fitting a logistic regression model to each and computing the training and test errors. Plot the errors of the training and test sets on the same plot. Do this without using sklearn.model_select.learning_curve
. Comment on your results.
In [55]:
# scaling and logistic regression
pipe = Pipeline([
('scl', StandardScaler()),
('lr', LogisticRegression())
])
scores = []
# looping through test sizes from 0.98 to 0.02 (0.99 gave error because I didn't have all the classes in the split)
# for tsize in np.arange(0.98, 0.01, -0.01):
for tsize in np.arange(0.01, 0.99, 0.01):
# splitting
Xtr, Xts, ytr, yts = train_test_split(wine.drop('class', axis=1),
wine['class'],
test_size=tsize,
random_state=1,
# stratify=wine['class'] # not used in solutions
)
# fit and score
pipe.fit(Xtr, ytr)
scores.append([tsize, pipe.score(Xtr, ytr), pipe.score(Xts, yts)])
scores = np.array(scores)
In [56]:
# plot train and test scores by increasing train size (1-test size)
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(1 - scores[:, 0], scores[:, 1], label='Train')
ax.plot(1 - scores[:, 0], scores[:, 2], label='Test')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.05])
ax.set_xlabel('Train size (%)')
ax.set_ylabel('Accuracy')
ax.set_title('Learning Curves')
ax.legend();
As expected the test accuracy goes up when the train size is increased; strangely we always get 100% accuracy on the training set.
In the solution without scaling the results are worse, let's try it out:
In [57]:
lr = LogisticRegression()
scores = []
for tsize in np.arange(0.01, 0.99, 0.01):
# splitting
Xtr, Xts, ytr, yts = train_test_split(wine.drop('class', axis=1),
wine['class'],
test_size=tsize,
random_state=1,
)
# fit and score
lr.fit(Xtr, ytr)
scores.append([tsize, lr.score(Xtr, ytr), lr.score(Xts, yts)])
scores = np.array(scores)
In [58]:
# plot train and test scores by increasing train size (1-test size)
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(1 - scores[:, 0], scores[:, 1], label='Train')
ax.plot(1 - scores[:, 0], scores[:, 2], label='Test')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.05])
ax.set_xlabel('Train size (%)')
ax.set_ylabel('Accuracy')
ax.set_title('Learning Curves')
ax.legend();
Still a bit different...
3 - Repeat part (2) but this time using sklearn.model_select.learning_curve
. Comment on your results.
In [77]:
train_sizes, train_scores, test_scores = learning_curve(pipe,
wine.drop('class', axis=1),
wine['class'],
train_sizes=np.arange(0.35, 0.99, 0.01),
cv=10,
random_state=5)
In [130]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(train_sizes, train_scores.mean(axis=1), label='Train')
ax.plot(train_sizes, test_scores.mean(axis=1), label='Test')
ax.set_ylim([0.4, 1.05])
ax.set_xlabel('Train size')
ax.set_ylabel('Accuracy')
ax.set_title('Learning Curves')
ax.legend();
Using learning curves the accuracy on the training set is always perfect, while the test accuracy increases without getting to 100%: this may be because learning curve uses cross validation and so we are less dependent on the splits.
4 - Use K-Fold cross validation to tune the regularization strength parameter of the logistic regression. Do this for values of k
ranging from 2
to 5
, make relevant plots, and comment on your results. Do this without using sklearn.model_select.evaluation_curve
.
In [146]:
# using grid search which uses k-fold cv by default
param_range = [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]
param_grid = [{'lr__C': param_range}]
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid)
scores = {}
best_scores = {}
# looping through number of folds
for k in [2, 3, 4, 5]:
# set params
gs.set_params(cv=k)
# fit
gs.fit(wine.drop('class', axis=1), wine['class'])
# saving best scores and mean scores for all params
best_scores[k] = [gs.best_params_['lr__C'], gs.best_score_]
scores[k] = np.array([param_range, gs.cv_results_['mean_train_score'], gs.cv_results_['mean_test_score']]).T
In [155]:
best_scores
Out[155]:
In [148]:
# creating 4 plots, one for each number of fold
fig, ax = plt.subplots(2, 2, figsize=(16, 8), sharex='col', sharey='row', subplot_kw=dict(xscale='log'))
i, j = 0, 0
for k in scores:
# plotting train and test scores and the best score
ax[i, j].plot(scores[k][:, 0], scores[k][:, 1], label='Train')
ax[i, j].plot(scores[k][:, 0], scores[k][:, 2], label='Test')
ax[i, j].plot(best_scores[k][0], best_scores[k][1], 'ro', label='Best Score')
ax[i, j].set_title('{}-fold CV'.format(k))
if j == 0:
ax[i, j].set_ylabel('Accuracy')
if i == 1:
ax[i, j].set_xlabel('C value')
ax[i, j].legend()
j += 1
if j == 2:
j = 0
i += 1
Training scores are very similar in each of the plots, there are some differences between the test accuracy curves where the best value for C seems to go towards 1.
In [63]:
# solution:
X = wine.drop('class', axis=1)
y = wine['class']
k_vals = np.arange(2, 6, 1)
c_vals = np.arange(0.5, 1.5, 0.05)
err = []
for k in k_vals:
for c in c_vals:
lm = LogisticRegression(C=c)
temp_err = cross_val_score(lm, X=X, y=y, cv=k)
err_dict = {}
err_dict['k'] = k
err_dict['c'] = c
err_dict['acc'] = temp_err.mean()
err.append(err_dict)
err = pd.DataFrame(err)
In [64]:
for k in err.k.unique():
plt.plot(c_vals, err.acc[err.k == k], label='k = {}'.format(k))
plt.title('Accuracy vs Regularization Strength')
plt.xlabel('C')
plt.ylabel('Accuracy')
plt.legend();
5 - Tune the regularization strength parameter as in part (4), but this time using sklearn.model_select.evaluation_curve
. Comment on your results.
In [142]:
scores = {}
best_scores = {}
for k in [2, 3, 4, 5]:
# validation cruve uses cv and wants params similar to grid search
train_scores, test_scores = validation_curve(estimator=pipe,
X=wine.drop('class', axis=1),
y=wine['class'],
param_name='lr__C',
param_range=param_range,
cv=k)
# saving mean scores
mean_train_scores = train_scores.mean(axis=1)
mean_test_scores = test_scores.mean(axis=1)
# and best scores
best_scores[k] = [param_range[mean_test_scores.argmax()], mean_test_scores[mean_test_scores.argmax()]]
scores[k] = np.array([param_range, mean_train_scores, mean_test_scores]).T
In [154]:
best_scores
Out[154]:
In [143]:
fig, ax = plt.subplots(2, 2, figsize=(16, 8), sharex='col', sharey='row', subplot_kw=dict(xscale='log'))
i, j = 0, 0
for k in scores:
ax[i, j].plot(scores[k][:, 0], scores[k][:, 1], label='Train')
ax[i, j].plot(scores[k][:, 0], scores[k][:, 2], label='Test')
ax[i, j].plot(best_scores[k][0], best_scores[k][1], 'ro', label='Best Score')
ax[i, j].set_title('{}-fold CV'.format(k))
if j == 0:
ax[i, j].set_ylabel('Accuracy')
if i == 1:
ax[i, j].set_xlabel('C value')
ax[i, j].legend()
j += 1
if j == 2:
j = 0
i += 1
The plots are the same as above.
6 - Fit another classification algorithm to the data, tuning the parameter using LOOCV. Comment on your results.
In [153]:
pipe = Pipeline([
('scl', StandardScaler()),
('svm', SVC())
])
# using grid search with LeaveOneOut as cv
param_range = [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]
param_grid = [{'svm__C': param_range}]
# in solution: cross_val_score(pipe, X, y, cv=LeaveOneOut()) used looping through params values
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
cv=LeaveOneOut())
gs.fit(wine.drop('class', axis=1), wine['class'])
gs.best_params_['svm__C'], gs.best_score_
Out[153]:
Using LOOCV I get 0.3 as the best choice for C and an accuracy of almost 99%, which is slightly higher than before.
7 - Suppose that the wine data the we received was incomplete, containing, say, only 20% of the full set, but due to a fast approaching deadline, we need to still compute some statistics and fit a model. Use bootstrapping to compute the mean and variance of the features, and fit the classification model you used in part (4), comparing and commenting on your results with those from the full dataset.
In [6]:
# resampling 5 times the data
wine_strapped = resample(wine, n_samples=len(wine)*5, random_state=10)
In [67]:
# solution: leaves 80% of the data out and bootstraps the remaining 20%:
data_sub_test, data_sub = train_test_split(wine, test_size=0.2, random_state=0)
data_boot = data_sub.sample(n=150, replace=True, random_state=0)
In [69]:
# still from solution:
means = {'original': wine.drop('class', axis=1).mean(),
'20%': data_sub.drop('class', axis=1).mean(),
'bootstrap%': data_boot.drop('class', axis=1).mean()}
var = {'original': wine.drop('class', axis=1).var(),
'20%': data_sub.drop('class', axis=1).var(),
'bootstrap%': data_boot.drop('class', axis=1).var()}
means = pd.DataFrame(means)
var = pd.DataFrame(var)
In [70]:
means
Out[70]:
In [72]:
var
Out[72]:
In [79]:
# last from solution:
pipe = Pipeline([
('scl', StandardScaler()),
('lr', LogisticRegression())
])
pipe.set_params(lr__C=1)
pipe.fit(wine.drop('class', axis=1), wine['class'])
print(cross_val_score(pipe, wine.drop('class', axis=1), wine['class'], cv=5).mean())
pipe.fit(data_boot.drop('class', axis=1), data_boot['class'])
print(cross_val_score(pipe, data_sub_test.drop('class', axis=1), data_sub_test['class'], cv=5).mean())
In [7]:
wine.describe().T
Out[7]:
In [8]:
wine_strapped.describe().T
Out[8]:
In [9]:
pipe = Pipeline([
('scl', StandardScaler()),
('lr', LogisticRegression())
])
# using grid search to fit the best logistic regression
param_range = [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]
param_grid = [{'lr__C': param_range}]
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid)
scores = {}
best_scores = {}
# looping through number of folds
for k in [2, 3, 4, 5]:
gs.set_params(cv=k)
gs.fit(wine_strapped.drop('class', axis=1), wine_strapped['class'])
best_scores[k] = [gs.best_params_['lr__C'], gs.best_score_]
scores[k] = np.array([param_range, gs.cv_results_['mean_train_score'], gs.cv_results_['mean_test_score']]).T
In [10]:
best_scores
Out[10]:
In all the cases I get 100% accuracy!
All that has been within the previous sections is part of the much broader topic known as model selection. Model selection is not only about choosing the right machine learning algorithm to use, but also about tuning parameters while keeping overfitting and scalability issues in mind. In this exercise, we'll build models for a couple different datasets, using all of the concepts you've worked with previously.
1 - Head over to the Machine Learning Repository, download the Mushroom Data Set, and put it into a dataframe. Be sure to familiarize yourself with the data before proceeding. Break the data into training and testing sets. Be sure to keep this testing set the same for the duration of this exercise as we will be using to test various algorithms!
In [80]:
shrooms = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', header=None)
shrooms.columns = ['poisonous', 'cap_shape', 'cap_surface', 'cap_color', 'bruises', 'odor', 'gill_attachment', 'gill_spacing',
'gill_size', 'gill_color', 'stalk_shape', 'stalk_root', 'stalk_surface_above_ring',
'stalk_surface_below_ring', 'stalk_color_above_ring', 'stalk_color_below_ring', 'veil_type',
'veil_color', 'ring_number', 'ring_type', 'spore_print_color', 'population', 'habitat']
In [49]:
shrooms.head()
Out[49]:
In [50]:
shrooms.info()
In [51]:
shrooms.describe().T
Out[51]:
We have 2480 missing values for stalk_root (denoted by ?), which I'm going to keep as they are since this is a category.
I'm dropping veil_type instead because it has only 1 class.
In [81]:
shrooms.drop('veil_type', axis=1, inplace=True)
In [82]:
Xtrain, Xtest, ytrain, ytest = train_test_split(shrooms.drop('poisonous', axis=1),
shrooms.poisonous,
test_size=0.2,
random_state=42)
Also, I'm relabeling the poisonous mushrooms as 1 and not poisonous ones as 0.
In [83]:
ytrain = ytrain.apply(lambda x: 1 if x == 'p' else 0)
ytest = ytest.apply(lambda x: 1 if x == 'p' else 0)
2 - Fit a machine learning algorithm of your choice to the data, tuning hyperparameters using the Better Holdout Method. Generate training and validation plots and comment on your results.
In [ ]:
# in solution he encodes all variables once using df.apply(le.fit_transform)
This is a class to perform label encoding on multiple columns:
In [84]:
class MultiLabelEncoder(BaseEstimator, TransformerMixin):
def __init__(self, columns=None):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname, col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
# def fit_transform(self, X, y=None):
# return self.fit(X, y).transform(X)
# def get_params(self):
# return 'columns'
# def set_params(self, param, value):
# self.param = value
In [85]:
# encode, one hot encode and then KNN
pipe = Pipeline([
('mle', MultiLabelEncoder()),
('ohe', OneHotEncoder()),
('knn', KNeighborsClassifier())
])
In [57]:
# better holdout
Xtr, Xcv, ytr, ycv = train_test_split(Xtrain, ytrain, test_size=0.2, random_state=42)
scores = []
# looping through various neighbors number
for k in range(1, 21):
pipe.set_params(knn__n_neighbors=k)
pipe.fit(Xtr, ytr)
tr_score = pipe.score(Xtr, ytr)
cv_score = pipe.score(Xcv, ycv)
scores.append([k, tr_score, cv_score])
scores = np.array(scores)
In [58]:
scores
Out[58]:
In [59]:
fig, ax = plt.subplots(figsize=(16, 8))
# plotting train and holdout scores
ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='Hold Out')
ax.set_title('KNN with Different # of Neighbors'.format(k))
ax.set_ylabel('Accuracy')
ax.set_xlabel('# of Neighbors')
ax.legend();
In [60]:
pipe.set_params(knn__n_neighbors=5)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)
Out[60]:
The best results are for a number of neighbors between 2 and 7 (100% accuracy on both train and test set), so I'm going with the default of 5.
3 - Repeat part (2), this time using cross validation. Comment on your results.
In [86]:
# from solution:
train_scores, test_scores = validation_curve(pipe, Xtrain, ytrain, param_name='knn__n_neighbors', param_range=range(1, 21))
In [89]:
# solution continues:
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.plot(range(1, 21), train_scores_mean, label='Train')
plt.fill_between(range(1, 21), train_scores_mean - train_scores_std, train_scores_mean + train_scores_std)
plt.plot(range(1, 21), test_scores_mean, label='Test')
plt.fill_between(range(1, 21), test_scores_mean - test_scores_std, test_scores_mean + test_scores_std)
plt.legend();
In [61]:
scores = []
# same as above, only using kfold
for k in range(1, 11):
kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain) # you have to do this every time because the folds
# are eliminated after use
fold_scores = []
pipe.set_params(knn__n_neighbors=k)
print('{} Neighbors'.format(k))
# looping through the folds
for i, (train, test) in enumerate(kfold):
# creating train and cv sets
Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
# fit and score
pipe.fit(Xtr, ytr)
tr_score = pipe.score(Xtr, ytr)
cv_score = pipe.score(Xcv, ycv)
# append fold scores
fold_scores.append([i, tr_score, cv_score])
print('Fold {} done'.format(i+1))
fold_scores = np.array(fold_scores)
# append mean and std score of the folds
scores.append([k, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])
scores = np.array(scores)
In [62]:
scores
Out[62]:
In [63]:
fig, ax = plt.subplots(figsize=(16, 8))
# plotting train and cv mean scores and filling with std
ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
scores[:, 1] + scores[:, 3],
scores[:, 1] - scores[:, 3],
alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
scores[:, 2] + scores[:, 4],
scores[:, 2] - scores[:, 4],
alpha=0.15, color='orange')
ax.set_title('KNN with Different # of Neighbors'.format(k))
ax.set_ylabel('Accuracy')
ax.set_xlabel('# of Neighbors')
ax.legend();
We get perfect accuracy for a number of neighbors between 2 and 8 this time; we can also see that we have no variance for this range of values.
4 - Repeat part (3) using a different machine learning algorithm. Comment on your results.
In [64]:
# encoder and random forest
pipe = Pipeline([
('mle', MultiLabelEncoder()),
#('ohe', OneHotEncoder()),
('forest', RandomForestClassifier())
])
In [65]:
scores = []
for k in np.arange(2, 50, 2):
kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain)
fold_scores = []
pipe.set_params(forest__n_estimators=k)
print('{} Trees'.format(k))
for i, (train, test) in enumerate(kfold):
Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
pipe.fit(Xtr, ytr)
tr_score = pipe.score(Xtr, ytr)
cv_score = pipe.score(Xcv, ycv)
fold_scores.append([i, tr_score, cv_score])
print('Fold {} done'.format(i+1))
fold_scores = np.array(fold_scores)
scores.append([k, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])
scores = np.array(scores)
In [66]:
scores
Out[66]:
In [67]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
scores[:, 1] + scores[:, 3],
scores[:, 1] - scores[:, 3],
alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
scores[:, 2] + scores[:, 4],
scores[:, 2] - scores[:, 4],
alpha=0.15, color='orange')
ax.set_title('Random Forest with different # of Trees'.format(k))
ax.set_ylabel('Accuracy')
ax.set_xlabel('# of Trees')
ax.legend();
In [68]:
pipe.set_params(forest__n_estimators=10)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)
Out[68]:
We get perfect accuracy for about 8 trees or more, so I'm going with the default value of 10.
5 - Which ever of your two algorithms in parts (3) and (4) performed more poorly, perform a variable selection to see if you can improve your results.
In [70]:
mle = MultiLabelEncoder()
pca = PCA()
pca.fit(mle.fit_transform(Xtrain), ytrain)
Out[70]:
In [71]:
pca.explained_variance_ratio_
Out[71]:
Let's try 7 components:
In [72]:
pipe = Pipeline([
('mle', MultiLabelEncoder()),
('sel', PCA(n_components=7)),
('forest', RandomForestClassifier())
])
In [74]:
scores = []
for k in np.arange(2, 22, 2):
kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain)
fold_scores = []
pipe.set_params(forest__n_estimators=k)
print('{} Trees'.format(k))
for i, (train, test) in enumerate(kfold):
Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
pipe.fit(Xtr, ytr)
tr_score = pipe.score(Xtr, ytr)
cv_score = pipe.score(Xcv, ycv)
fold_scores.append([i, tr_score, cv_score])
print('Fold {} done'.format(i+1))
fold_scores = np.array(fold_scores)
scores.append([k, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])
scores = np.array(scores)
In [75]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
scores[:, 1] + scores[:, 3],
scores[:, 1] - scores[:, 3],
alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
scores[:, 2] + scores[:, 4],
scores[:, 2] - scores[:, 4],
alpha=0.15, color='orange')
ax.set_title('Random Forest with different # of Trees'.format(k))
ax.set_ylabel('Accuracy')
ax.set_xlabel('# of Trees')
ax.legend();
Well it's actually worse!
6 - Pick a classification algorithm that has at least two hyperparameters and tune them using GridSeachCV
. Comment on your results.
I'm trying logistic regression + PCA tuning number of components and regularization:
In solution he uses random forest tuning n_estimators and max_features!
In [8]:
from sklearn.feature_selection import RFE
So, apparently GridSearchCV gets stuck if I use custom scorers or if I'm multithreading or something like that, so I'm going to try to encode my labels outside the pipeline...
In [9]:
mle = MultiLabelEncoder()
Xtrain_enc = mle.fit_transform(Xtrain)
Xtest_enc = mle.transform(Xtest)
In [13]:
pipe = Pipeline([
('sel', RFE(estimator=LogisticRegression())),
('lr', LogisticRegression())
])
# C_range = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300]
C_range = [0.01, 0.03, 0.1, 0.3, 1, 3]
comp_range = np.arange(1, 16)
# using grid search to tune regularization and number of PCs
param_grid = [{'sel__n_features_to_select':comp_range, 'lr__C': C_range}]
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
cv=10)
# this could take a very long time...
gs.fit(Xtrain_enc, ytrain)
Out[13]:
In [14]:
print(gs.best_score_)
print(gs.best_params_)
In [16]:
pipe.set_params(sel__n_features_to_select=13, lr__C=0.3)
pipe.fit(Xtrain_enc, ytrain)
pipe.score(Xtest_enc, ytest)
Out[16]:
We get 95% accuracy with simple logistic regression, not bad!
1 - Head over to the Machine Learning Repository, download the Parkinson's Telemonitoring Data Set, and put it into a dataframe. Be sure to familiarize yourself with the data before proceeding, removing the columns related to time, age, and sex. We're going to be predicting motor_UPDRS
, so drop the total_UPDRS
variable as well. Break the data into training and testing sets. Be sure to keep this testing set the same for the duration of this exercise as we will be using to test various algorithms!
In [90]:
parkinson = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/telemonitoring/parkinsons_updrs.data')
parkinson.drop(['age', 'sex', 'test_time', 'total_UPDRS'], axis=1, inplace=True)
In solution he uses dummies for subject#, which I don't think is really "honest".
In [4]:
parkinson.head()
Out[4]:
In [5]:
parkinson.info()
In [6]:
parkinson.describe().T
Out[6]:
In [91]:
# I'm dropping subject# because it is the patient id
Xtrain, Xtest, ytrain, ytest = train_test_split(parkinson.drop(['subject#', 'motor_UPDRS'], axis=1),
parkinson.motor_UPDRS,
test_size=0.2,
random_state=54)
2 - Fit a machine learning algorithm of your choice to the data, tuning hyperparameters using the Better Holdout Method. Generate training and validation plots and comment on your results.
In [92]:
# scaling and ridge regression
pipe = Pipeline([
('stsc', StandardScaler()),
('ridge', Ridge())
])
In [93]:
# better hold out
Xtr, Xcv, ytr, ycv = train_test_split(Xtrain, ytrain, test_size=0.2, random_state=42)
scores = []
# looping through regularization values
for a in [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]:
pipe.set_params(ridge__alpha=a)
pipe.fit(Xtr, ytr)
tr_score = pipe.score(Xtr, ytr)
cv_score = pipe.score(Xcv, ycv)
scores.append([a, tr_score, cv_score])
scores = np.array(scores)
In [94]:
scores
Out[94]:
In [95]:
fig, ax = plt.subplots(figsize=(16, 8), subplot_kw=dict(xscale='log'))
ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='Hold Out')
ax.set_title(r'Ridge Regression with different $\alpha$')
ax.set_ylabel('$R^2$')
ax.set_xlabel(r'\alpha')
ax.legend();
These doesn't seems so randomly scattered...
In [97]:
ypred = pipe.predict(Xtrain)
residuals = ypred - ytrain
plt.figure(figsize=(16, 8))
plt.scatter(ypred, residuals)
plt.hlines(y=0, xmin=ypred.min(), xmax=ypred.max(), lw=2, color='red');
In [98]:
pipe.set_params(ridge__alpha=100)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)
Out[98]:
The score is very bad, the best choice for the regularization parameter seems to be 100 but there seems to be a problem of serious bias in this dataset. We may need more features to get accurate predictions.
3 - Repeat part (2), this time using cross validation. Comment on your results.
Again in the solution he uses cross_validation_curve...
In [14]:
scores = []
for a in [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]:
kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain)
fold_scores = []
pipe.set_params(ridge__alpha=a)
print('alpha = {}'.format(a))
for i, (train, test) in enumerate(kfold):
Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
pipe.fit(Xtr, ytr)
tr_score = pipe.score(Xtr, ytr)
cv_score = pipe.score(Xcv, ycv)
fold_scores.append([i, tr_score, cv_score])
print('Fold {} done'.format(i+1))
fold_scores = np.array(fold_scores)
scores.append([a, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])
scores = np.array(scores)
In [15]:
scores
Out[15]:
In [16]:
fig, ax = plt.subplots(figsize=(16, 8), subplot_kw=dict(xscale='log'))
ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
scores[:, 1] + scores[:, 3],
scores[:, 1] - scores[:, 3],
alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
scores[:, 2] + scores[:, 4],
scores[:, 2] - scores[:, 4],
alpha=0.15, color='orange')
ax.set_title(r'Ridge Regression with different $\alpha$')
ax.set_ylabel('$R^2$')
ax.set_xlabel(r'$\alpha$')
ax.legend();
The two curves are even closer, but again performances are overall very poor. This plot confirms that we have a high bias for this dataset.
4 - Repeat part (3) using a different machine learning algorithm. Comment on your results.
In [104]:
# scaling and random forest
pipe = Pipeline([
('scl', StandardScaler()),
('forest', RandomForestRegressor())
])
In [105]:
scores = []
# looping through different number of trees
for n in [3, 10, 30, 100, 300]:
kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain)
fold_scores = []
pipe.set_params(forest__n_estimators=n)
print('# trees = {}'.format(n))
# looping through folds
for i, (train, test) in enumerate(kfold):
Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
pipe.fit(Xtr, ytr)
tr_score = pipe.score(Xtr, ytr)
cv_score = pipe.score(Xcv, ycv)
fold_scores.append([i, tr_score, cv_score])
print('Fold {} done'.format(i+1))
fold_scores = np.array(fold_scores)
scores.append([n, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])
scores = np.array(scores)
In [106]:
scores
Out[106]:
In [107]:
fig, ax = plt.subplots(figsize=(16, 8))
ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
scores[:, 1] + scores[:, 3],
scores[:, 1] - scores[:, 3],
alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
scores[:, 2] + scores[:, 4],
scores[:, 2] - scores[:, 4],
alpha=0.15, color='orange')
ax.set_title('Random Forest Regression with Different # of Trees')
ax.set_ylabel('$R^2$')
ax.set_xlabel('# of Trees')
ax.legend();
In [108]:
pipe.set_params(forest__n_estimators=100)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)
Out[108]:
Using a random forest we get better performances but we have a serious problem of high variance in this case.
5 - Which ever of your two algorithms in parts (3) and (4) performed more poorly, perform a variable selection to see if you can improve your results.
I'm going to use PCA to try and improve the R regression, but I don't think there is much to do...
In solution he uses LassoCV with SelectFromModel:
In [101]:
scl = StandardScaler()
sfm = SelectFromModel(LassoCV())
Xtrain_sc = scl.fit_transform(Xtrain)
Xtest_sc = scl.transform(Xtest)
Xtrain_lm = sfm.fit_transform(Xtrain_sc, ytrain)
Xtest_lm = sfm.transform(Xtest_sc)
train_scores, test_scores = validation_curve(Ridge(), Xtrain_lm, ytrain,
param_name='alpha',
param_range=[0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000])
In [103]:
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fig, ax = plt.subplots(figsize=(16, 8), subplot_kw=dict(xscale='log'))
ax.plot([0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000],
train_scores_mean, label='Train')
ax.fill_between([0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000],
train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.2)
ax.plot([0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000],
test_scores_mean, label='Test')
ax.fill_between([0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000],
test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.2)
ax.legend();
In [21]:
pca = PCA()
pca.fit(Xtrain)
pca.explained_variance_ratio_
Out[21]:
I guess we can try with 12 components
In [22]:
pipe = Pipeline([
('scl', StandardScaler()),
('pca', PCA(n_components=12)),
('ridge', Ridge())
])
In [23]:
scores = []
for a in [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 100, 300, 1000, 3000]:
kfold = KFold(n_splits=10, random_state=1).split(Xtrain, ytrain)
fold_scores = []
pipe.set_params(ridge__alpha=a)
print('alpha = {}'.format(a))
for i, (train, test) in enumerate(kfold):
Xtr, ytr = Xtrain.iloc[train, :], ytrain.iloc[train]
Xcv, ycv = Xtrain.iloc[test, :], ytrain.iloc[test]
pipe.fit(Xtr, ytr)
tr_score = pipe.score(Xtr, ytr)
cv_score = pipe.score(Xcv, ycv)
fold_scores.append([i, tr_score, cv_score])
print('Fold {} done'.format(i+1))
fold_scores = np.array(fold_scores)
scores.append([a, fold_scores.mean(axis=0)[1], fold_scores.mean(axis=0)[2],
fold_scores.std(axis=0)[1], fold_scores.std(axis=0)[2]])
scores = np.array(scores)
In [24]:
scores
Out[24]:
In [25]:
fig, ax = plt.subplots(figsize=(16, 8), subplot_kw=dict(xscale='log'))
ax.plot(scores[:, 0], scores[:, 1], label='Train')
ax.plot(scores[:, 0], scores[:, 2], label='CV')
ax.fill_between(scores[:, 0],
scores[:, 1] + scores[:, 3],
scores[:, 1] - scores[:, 3],
alpha=0.15, color='blue')
ax.fill_between(scores[:, 0],
scores[:, 2] + scores[:, 4],
scores[:, 2] - scores[:, 4],
alpha=0.15, color='orange')
ax.set_title(r'Ridge Regression with different $\alpha$')
ax.set_ylabel('$R^2$')
ax.set_xlabel(r'$\alpha$')
ax.legend();
Not much of an improvement...
6 - Pick a regression algorithm that has at least two hyperparameters and tune them using GridSeachCV
. Comment on your results.
I'm going to try and add some polynomial features to the Ridge regression to see if it can perform better:
In [44]:
pipe = Pipeline([
('scl', StandardScaler()),
('pca', PCA(n_components=9)),
('poly', PolynomialFeatures()),
('ridge', Ridge())
])
alpha_range = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30]
degree_range = np.arange(1, 4)
# using grid search to tune regularization and degree of polynomials
param_grid = [{'poly__degree':degree_range, 'ridge__alpha': alpha_range}]
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
cv=10,
n_jobs=2)
gs.fit(Xtrain, ytrain)
Out[44]:
In [46]:
print(gs.best_score_)
print(gs.best_params_)
In [47]:
pipe.set_params(poly__degree=2, ridge__alpha=30)
pipe.fit(Xtrain, ytrain)
pipe.score(Xtest, ytest)
Out[47]:
We have a slight improvement introducing polynomial features, but the score is still very low...
In solutions using random forest and tuning n_estimators and max_features he gets 0.357.